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AUTOMATED VIDEO PRODUCTION 
SYSTEM AND METHOD USING 
EXPERT VIDEO PRODUCTION RULES 
FOR ONLINE PUBLISHING OF 

LECTURES 

Background of Invention 

[OOOIilj 1 . Field of the Invention 

[0002L The present invention relates in general to automated video production and more 

01 particularly to an automated system and method for producing videos using expert video 

F production rules. 

[0003^ 2. Related Art 

[0004] A lecture (such as, for example, a meeting, a talk, a seminar, a presentation and classroom 
instruction) is an important tool whereby knowledge transfer, teaching and learning can occur. 
A lecture, which can be any setting whereby an exchange of information occurs, may include at 
least one lecturer and an audience. Universities and many corporations, inspired by both the 
increased emphasis on life-long learning and the rapid pace of technological change, are 
offering an ever-increasing selection of lectures to teach, train and inform students and 
employees. In order to accommodate all persons interested in a lecture and to manage the 
inevitable time conflicts that occur, many of these lectures are made accessible "online" over a 
computer network. Viewing a lecture online allows a person to remotely view the lecture either 
while the actual lecture is occurring ("live") or at a later time ("on-demand"). 

[0005] | n order t0 facilitate viewing of a lecture both live and on-demand, the lecture must be 
made available (or published) online. Online publishing of a lecture enables a person to view 

Page 1 of 45 



the lecture at a time and location that is convenient for the person. Online publishing of 
lectures is becoming more feasible and popular due to continuous improvements in computer 
network infrastructure and streaming-media technologies. The popularity of publishing lecture 
online is gaining momentum slowly, however, and the fact remains that the majority of lectures 
that take place are not recorded or made available online. Two key barriers to online publishing 
are the cost of equipping lecture rooms with the equipment (such as cameras) needed to 
capture the lecture and the labor costs associated with having people produce the lecture video. 
Equipment cost is a one-time cost and tends to become less expensive as market demand 
increases. Labor cost for video production, however, is a recurring cost and one is of the main 
prohibitions to the online publishing of lectures. 

As computer technology continues to advance, one alternative to hiring a human video 
production team is to construct a fully automated video production system that replaces human 
operators with machines. Automated video production systems and methods that are able to 
provide high-quality recordings are highly desirable because labor costs associated with video 
production are greatly reduced. Because labor costs are a major obstacle to recording lectures, 
reducing these labor costs by using high-quality automated video production techniques 
instead of a human video production team reduces at least some obstacles to online publishing 
of lectures. 

There are a few existing automated video production systems and research prototypes. 
However, each of these systems has one of more of the following limitations: 

■ Single camera setup: One disadvantage with a single camera setup is that there are long 
switching times between camera views. In other words, when switching between camera views, 
the single camera must be repositioned and this repositioning takes time. This causes a delay 
in the transition from a current camera view to a new camera view. Another disadvantage is that 
there is no backup camera in case the tracking device of the single camera fails, which can 
happen in a fully automated system. 

■ Using invasive sensors to track lecturers: These sensors must be worn by the lecturer and 
may be bothersome to the lecturer and interfere with the lecturer's freedom of movement. 

■ Simplistic directing rules: One disadvantage of having simplistic directing rules is that 
they do not automatically produce lecture videos that are of the same high quality as lecture 
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videos produced by a professional human video production team. For example, many existing 
systems use directing rules that state camera views should be switched when a triggering event 
of a transition from one slide to another occurs. 

[001 1 ] ■ Lack of an audience-tracking camera: Human video production teams know and studies 
have shown that the audience is an important and useful part of the lecture experience and 
include camera views of the audience when capturing the lecture. An audience camera can 
focus on an audience member who asks questions and can provide random audience shots to 
make the lecture more enjoyable for a viewer to watch. Leaving out camera views of the 
audience and those audience members who may ask questions by not providing an audience- 
tracking camera severely degrades the quality of the final lecture video. 

[001 4| Accordingly, there exists a need for an automated video production technique that is 
jf: automated to alleviate human labor costs associated with producing a lecture video. What is 
y~ needed is an automated video production technique that captures lectures in a professional and 
p high-quality manner using the similar rules used by professional human directors and 
Ul cinematographers. What is further needed is an automated video production system that 
p includes multiple cameras to enable fast switching of camera views, to allow a variety of camera 
views and to provide backup cameras in the case of camera failure. Moreover, what is needed is 
J: an automated video production technique that uses an audience-tracking camera to provide 
audience camera views and to make the lecture video more enjoyable for a viewer to watch. 
Moreover, what is needed is an automated video production system and method that tracks a 
lecturer in a lecture without the need for the lecturer to wear bothersome and restricting 
sensors, such that the lecturer remains unaware of the tracking. This type of automated video 
production technique takes into account the desires of the viewing audience and caters to these 
desires to make the production and viewing of lectures online an enjoyable experience while at 
the same time enhancing the learning process. 

Summary of Invention 

[0013] 

To overcome the limitations in the prior art as described above and other limitations that 
will become apparent upon reading and understanding the present specification, the present 
invention includes an automated video production system and method that uses expert video 
production rules. The video production rules are obtained by determining the rules used by 
human experts in the video production field (such as professional human video producers) and 
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implementing the rules that are achievable given the available technology and at a reasonable 
cost. These video production rules may include, for example, rules on how the cameras should 
be positioned, rules on how a lecturer should be framed, and rules on how multiple camera 
views should be edited to generate a current camera view (or output camera view) seen by a 
user. 

[001 4] In general, the automated video production system of the present invention includes a 

camera system having at least one camera for capturing the lecture and a cinematographer for 
controlling the camera. The camera system preferably includes a plurality of cameras that 
provide multiple camera views. Alternatively, however, the system may include a single high- 
resolution panoramic camera having a wide angle field-of-view (such as an approximately 360- 
Q degree field-of-view). When this type of single camera is used, camera views are constructed by 
j5 focusing on a portion of the entire field of view. For example, if a lecturer is being tracked then 
Kl only the portion of the 360-degree image that includes the lecturer is displayed for a camera 
gj view. Alternatively, the system may include a plurality of cameras arranged such that an 
^ approximately 360-degree field of view is provided. The cinematographer provides control over 
R one or more of the cameras in the system. Preferably, the cinematographer is a virtual 
5f cinematographer that is a software module incorporating at least some of the video production 
H rules described below. Alternatively, the cinematographer may be an analog component or even 
n a human being that is controlling the camera. 

[001 5] The camera system preferably includes one or more audience-tracking cameras for 

providing audience images and tracking audience members. In addition, the camera system 
may include one or more lecturer-tracking cameras that non-invasively track a lecturer without 
the need for the lecturer to wear a sensor. The camera system may also lack a lecturer-tracking 
camera. The camera system may also include none, one or more overview cameras that provide 
an overview camera view of the lecture environment. Camera views are framed using the expert 
video production rules. For instance, the audience-tracking camera is used to show a camera 
view of the audience even if no one in the audience is talking, in order to follow an expert video 
production rule that the audience should be shown occasionally even if there is no question. 
Tracking of a subject is performed using a history-based, reduced-motion tracker. This tracker 
tracks a subject and sets a camera shot (zoom and pan) using the subject's movement history. 
Once the camera view is determined and set, it is fixed to reduce any distracting and 
unnecessary camera movements. 
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[001 6] The automated video production system also includes a virtual director module for 

selecting and determining which of the multiple camera views are a current camera view to be 
viewed by a user. Each of the cinematographers reports a camera status to the virtual director 
module. The virtual director module has two components: an event generator that generates a 
triggering event which initiates switching (or changing) from one camera view to another, and a 
finite state machine (FSM) to decide to which camera view to switch. Based on each reported 
status, the virtual director uses probabilistic rules and the expert video production rules to 
determine the current camera view. The virtual director determines the current camera view by 
deciding how long a camera should be used, when the current camera view should be changed 
and which camera view should be the new current camera view. Unlike other types of 
automated video production techniques that merely use a simplistic fixed rotation to select the 
J: current camera view, the present invention uses a probabilistic transition matrix constricted by 
fft expert video production rules to select the current camera view. Thus, the camera selected as 
12 the next current camera view is a weighted random choice. This enables the automated video 

production system and method of the present invention to produce a more aesthetically 
fji pleasing and professional video production without the expense of human video production 
JL team. 

[001 7jk Other aspects and advantages of the present invention as well as a more complete 

understanding thereof will become apparent from the following detailed description, taken in 
M 8 conjunction with the accompanying drawings, illustrating by way of example the principles of 
the invention. Moreover, it is intended that the scope of the invention be limited by the claims 
and not by the preceding summary or the following detailed description. 

Brief Description of Drawings 

[001 8] The present invention can be further understood by reference to the following description 
and attached drawings that illustrate preferred embodiments of the invention. Other features 
and advantages will be apparent from the following detailed description of the invention, taken 
in conjunction with the accompanying drawings, which illustrate, byway of example, the 
principles of the present invention. 

[001 9] Referring now to the drawings in which like reference numbers represent corresponding 
parts throughout: 
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[0020] FIG. 1 is an overall block diagram illustrating an implementation of the automated video 
production system of the present invention for capturing a lecture and is provided for 
illustrative purposes only. 

[0021] FIG. 2 is a general block diagram illustrating a computing platform as shown in FIG. 1 that 
preferably may be used to carry out the present invention. 

[0022] FIG. 3 is a general block/flow diagram illustrating the major components of the present 
invention and their interaction. 

[0023] FIG. 4 is a general block diagram illustrating a preferred implementation of the major 
components of the present invention. 

[0024C FIG. 5 is a geometric diagram illustrating the sound source localization technique of the 
m present invention. 

[002 S|3 FIG. 6 is a detailed block diagram illustrating the interaction between the audience-tracking 
cinematographer, the lecturer-tracking cinematographer, the overview camera 
^ cinematographer, the virtual director module and the mixer shown in FIG. 4. 

[002% FIG. 7 illustrates a preferred embodiment of the probabilistic location transition module 
j! shown in FIG. 6. 

[0027'] FIG. 8 illustrates a preferred embodiment of the remote audience graphical user interface 
shown in FIG. 4. 

Detailed Description 

[0028] In the following description of the invention, reference is made to the accompanying 

drawings, which form a part thereof, and in which is shown byway of illustration a specific 
example whereby the invention may be practiced. It is to be understood that other 
embodiments may be utilized and structural changes may be made without departing from the 
scope of the present invention. 

[0029] I. General Overview 

[0030] 

The present invention includes an automated video production system and method useful 
for producing videos of lectures for live broadcasting or on-demand viewing. The present 
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invention is modeled after a professional human video production team. The system of the 
present invention includes a camera system having one or more cameras for capturing a 
lecture, virtual director, and may include one or more virtual cinematographers. A virtual 
cinematographer can be assigned to a camera of the camera system and is used to control the 
camera in capturing the lecture by determining which camera view to obtain and how to track a 
subject (such as a lecturer). The virtual director receives data from the virtual cinematographer 
and determines which of the multiple camera views obtained by the camera system is the 
current (or output) camera view seen by a user. The edited lecture video is then encoded for 
both live broadcasting and on-demand viewing. Both the virtual cinematographer and the 
virtual director are controlled in accordance with expert video production rules. As explained in 
detail below, these rules provide a framework whereby the virtual cinematographer and the 

;|| virtual director can emulate a professional human video production team. 

is i 

[003 lp FIG. 1 is an overall block diagram illustrating an implementation of the automated video 
ig production system 1 00 of the present invention for capturing a lecture and is provided for 
f J illustrative purposes only. It should be noted that the automated video production system 1 00 

shown in FIG. 1 is only one example of numerous ways in which the present invention may be 

implemented. In general, the automated video production system 1 00 records the lecture, edits 
H= the lecture while recording and transmits the edited lecture video to a remote location. More 
□ specifically, a lecture is presented in a lecture room 1 05 by a lecturer 1 1 0. The lecturer 1 1 0 
^ typically uses a podium 1 1 5 within the lecture room 1 05 for holding aids such as notes and a 

microphone (not shown). The lecture room 1 05 also includes an area for an audience 1 20 to 

view the lecture. 

[0032] )n the j mp | ementat j on shown in FIG. 1, the capture of the lecture is achieved using a 

plurality of cameras and a plurality of cinematographers that are used to control some of the 
cameras. The plurality of cameras are positioned around the lecture room 1 05 such that views 
from each of the cameras differ significantly from each other. As discussed in detail below, this 
is to follow an expert video production rule that when transitioning from one camera shot to 
another the view should be significantly different. As shown in FIG. 1 , the automated video 
production system 100 includes an audience-tracking camera 125 that is positioned in the 
lecture room 105 to provide a camera view of members of the audience 1 20. Moreover, as 
explained in detail below, the audience-tracking camera 1 25 uses a sensor device 1 27 (such as 
a microphone array) to track specific members of the audience 1 20, such as those audience 
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members asking questions. A lecturer-tracking camera 1 30 is included in the automated video 
production system 1 00 and positioned in the lecture room 1 05 to provide camera views of the 
lecturer 1 10. As explained in detail below, the lecturer-tracking camera 1 30 of the present 
invention does not require that the lecturer 1 1 0 wear any tracking equipment (such as an 
infrared (IR) emitter or magnetic emitter) so that the lecturer 110 is not bothered by the need to 
wear the extra tracking equipment. 

[0033] The present invention also includes an overview camera 1 35 that allows the automated 
video production system 1 00 to follow the expert video production rule that the lecture 
environment should be established first. The overview camera 1 35 is also used as a backup 
camera, so that if one of the tracking cameras fails the camera view from the overview camera 
ri may be substituted. In this implementation, the overview camera 1 35 is static, although other 
ffl implementation may dictate that the overview camera be able to move. Moreover, the overview 
fg camera also may include a plurality of cameras, both static and dynamic. The automated video 
^ production system and method of the present invention starts with an overhead camera shot to 
yj establish the lecture environment. As part of the lecture the lecturer 1 1 0 may use sensory aids 
* 5 to augment the lecture. For example, as shown in FIG. 1 the lecturer 1 1 0 may use slides 
O projected onto a screen 140. These slides are captured using a visual presentation tracking 
Ll camera 145 that is capable of providing camera views of the visual presentations used in the 
J: lecture, which in this case are slides. 

[0034] The automatec j vjdeo production system 1 00 includes an automated camera management 
module 1 50 that includes virtual cinematographers, a virtual director, a mixer and an encoder 
(all discussed in detail below) for controlling the capture of the lecture, editing the lecture and 
providing a final output of the recorded lecture. As shown in FIG. 1 , in this implementation the 
automated video production system 100 resides on a single computer platform 1 55, assuming 
the computing power of the single computer platform 1 55 is enough. However, it should be 
noted that other implementations are possible, such as having each virtual cinematographer 
and virtual director reside on separate computer platforms. Moreover, even though FIG. 1 
illustrates various virtual cinematographers and the virtual director as separate modules, these 
modules may actually be resident on a single computer running different threads. In this 
implementation, the screen 140, visual presentation tracking camera 145, audience-tracking 
camera 1 25, lecturer-tracking camera 1 30 and overview camera 1 35 are connected to the 
automated camera management module 1 50 to facilitate production control. The finished 
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lecture video is presented to a remote audience 1 60 by transmission over a communication 
channel 7 65 (such as network). The remote audience 160 interfaces with the lecture video using 
a graphical user interface 1 70 residing on a remote computer platform 1 75. 



[0035] In a preferred embodiment, the computer platform 1 55 and the remote computer platform 
1 75 are computing machines (or devices) in a computing environment (such as a client/server 
networking environment). FIG. 2 is a general block diagram illustrating a computing platform as 
shown in FIG. 1 that preferably may be used to carry out the present invention. FIG. 2 and the 
following discussion are intended to provide a brief, general description of a suitable 
computing environment in which the automated video production system and method of the 
present invention may be implemented. Although not required, the present invention will be 
p described in the general context of computer-executable instructions (such as program 
Jf modules) being executed by a computer. Generally, program modules include routines, 
m programs, objects, components, data structures, etc. that perform particular tasks or 

.3 — SL. 

implement particular abstract data types. Moreover, those skilled in the art will appreciate that 
W the invention may be practiced with a variety of computer system configurations, including 
™ personal computers, server computers, hand-held devices, multiprocessor systems, 
y microprocessor-based or programmable consumer electronics, network PCs, minicomputers, 

mainframe computers, and the like. The invention may also be practiced in distributed 

computing environments where tasks are performed by remote processing devices that are 
H linked through a communications network. In a distributed computing environment, program 

modules may be located on both local and remote computer storage media including memory 

storage devices. 

[0036] Referring to FIG. 2, an exemplary system for implementing the present invention includes a 
general-purpose computing device in the form of a conventional personal computer 200, 
including a processing unit 202, a system memory 204, and a system bus 206 that couples 
various system components including the system memory 204 to the processing unit 202. The 
system bus 206 may be any of several types of bus structures including a memory bus or 
memory controller, a peripheral bus, and a local bus using any of a variety of bus architectures. 
The system memory includes read only memory (ROM) 210 and random access memory (RAM) 
212. A basic input/output system (BIOS) 214, containing the basic routines that help to transfer 
information between elements within the personal computer 200, such as during start-up, is 
stored in ROM 210. The personal computer 200 further includes a hard disk drive 216 for 
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reading from and writing to a hard disk (not shown), a magnetic disk drive 218 for reading from 
or writing to a removable magnetic disk 220, and an optical disk drive 222 for reading from or 
writing to a removable optical disk 224 (such as a CD-ROM or other optical media). The hard 
disk drive 216, magnetic disk drive 228 and optical disk drive 222 are connected to the system 
bus 206 by a hard disk drive interface 226, a magnetic disk drive interface 228 and an optical 
disk drive interface 230, respectively. The drives and their associated computer-readable media 
provide nonvolatile storage of computer readable instructions, data structures, program 
modules and other data for the personal computer 200. 

[0037] Although the exemplary environment described herein employs a hard disk, a removable 
magnetic disk 220 and a removable optical disk 224, it should be appreciated by those skilled 

O in the art that other types of computer readable media that can store data that is accessible by 
a computer, such as magnetic cassettes, flash memory cards, digital video disks, Bernoulli 

yj cartridges, random access memories (RAMs), read-only memories (ROMs), and the like, may 

nfj also be used in the exemplary operating environment. 

[003 $fs A number of program modules may be stored on the hard disk, magnetic disk 220, optical 
Eg disk 224, ROM 210 or RAM 212, including an operating system 232, one or more application 
f* programs 234, other program modules 236 and program data 238. A user (not shown) may 
:|S enter commands and information into the personal computer 200 through input devices such 
as a keyboard 240 and a pointing device 242. In addition, other input devices (not shown) may 
be connected to the personal computer 200 including, for example, a camera, a microphone, a 
joystick, a game pad, a satellite dish, a scanner, and the like. These other input devices are 
often connected to the processing unit 202 through a serial port interface 244 that is coupled 
to the system bus 206, but may be connected by other interfaces, such as a parallel port, a 
game port or a universal serial bus (USB). A monitor 246 or other type of display device is also 
connected to the system bus 206 via an interface, such as a video adapter 248. In addition to 
the monitor 246, personal computers typically include other peripheral output devices (not 
shown), such as speakers and printers. 



[0039] 



The personal computer 200 may operate in a networked environment using logical 
connections to one or more remote computers, such as a remote computer 250. The remote 
computer 250 may be another personal computer, a server, a router, a network PC, a peer 
device or other common network node, and typically includes many or all of the elements 
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described above relative to the personal computer 200, although only a memory storage device 
252 has been illustrated in FIG. 2. The logical connections depicted in FIG. 2 include a local area 
network (LAN) 254 and a wide area network (WAN) 256. Such networking environments are 
commonplace in offices, enterprise-wide computer networks, intranets and the Internet. 

[0040] When used in a LAN networking environment, the personal computer 200 is connected to 
the local network 254 through a network interface or adapter 258. When used in a WAN 
networking environment, the personal computer 200 typically includes a modem 260 or other 
means for establishing communications over the wide area network 256, such as the Internet. 
The modem 260, which may be internal or external, is connected to the system bus 206 via the 
serial port interface 244. In a networked environment, program modules depicted relative to 
q the personal computer 200, or portions thereof, may be stored in the remote memory storage 
JJf device 252. It will be appreciated that the network connections shown are exemplary and other 
fg means of establishing a communications link between the computers may be used. 

[004ljff II, Components of the Invention 

[004£] The present invention includes an automated video production system and method for 

Jjf online publishing of lectures. In order to produce a high-quality lecture video, human operators 
Is* need to perform many tasks including tracking a moving lecturer, locating a talking audience 
2 member, or showing presentation slides. It takes many years of training and experience for a 
human operator to efficiently perform these tasks. Consequently, high-quality videos are 
usually produced by a human video production team that includes a director and multiple 
cinematographers. Distributing the video production tasks to different crewmembers and 
creating final video products through collaboration make the video production process more 
efficient and smooth. This strategy is a good model reference for a computer-based automated 
video production system. Inspired by this idea, the automated video production system and 
method is organized according to the structure of a human video production team. 

[0043] j n genera^ t h e components of the present invention include a camera system containing at 
least one camera, a cinematographer and a virtual director. The camera system is used to 
capture the lecture and may include a single camera or a plurality of cameras. The 
cinematographer is connected to one or more cameras and is used to control the camera views 
of the camera. In addition, the cinematographer may be used to track a subject (such as a 
lecturer). The cinematographer may be digital (such as a software module), analog, or may even 
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be a human operator. The virtual director is a software module that receives the multiple 
camera views from the camera system and determines which of these multiple camera views to 
use as a current camera view. 



[0044] FIG. 3 is a general block/flow diagram illustrating the major components of the present 

invention and their interaction. The solid lines indicate video data and the dashed lines indicate 
control signals and status signals. It should be noted that the automated video production 
system of the present invention is a modular system and many of the components shown in 
FIG. 3 may be added to or subtracted from the system as desired. Typically, which components 
are included and the number of those components included in the system depend on the size 
and layout of the lecture room. 

p 

[0045|| As shown in FIG. 3 camera system 305 may be used to capture a lecture. The camera 
^ system 305 may include a single camera (such as a panoramic camera providing an 
H 1 approximately 360-degree field-of-view) or a plurality of cameras. These plurality of cameras 
jTj may include lecturer-tracking camera 1 30 for tracking a lecturer, an audience-tracking camera 
^ 1 25 for tracking audience members, an overview camera 1 35 for providing overview camera 
p views of the lecture environment, and a visual presentation tracking camera 145 for tracking 
f: any visual presentation (such as video or slides) used during the lecture. A plurality of 
*F cinematographers also are available to control each of the cameras. These cinematographers 
H include a lecturer-tracking cinematographer 300, an audience-tracking cinematographer 310, 
an overview-tracking cinematographer 320 and a visual presentation tracking cinematographer 
330. A virtual director 340 is used to manage the video output of each camera. A mixer 350 is 
controlled by the virtual director 340 and mixes together the video inputs from each of the 
cameras as directed by the virtual director 340. The mixer 350 may be an analog mixer or may 
be a digital mixer that includes a software module that performs the mixing functions. An 
encoder 360 encodes the final product and sends as output a video of the lecture. 

[0046] 

The system of the present invention is modeled after a human video production teams. This 
model includes using a two-level structure to simulate the roles performed by human 
cinematographers and directors. At the lower level, the virtual cinematographers of the present 
invention are assigned to different cameras to perform basic video shooting tasks, such as 
tracking a lecturer or locating a talking audience member. Each virtual cinematographer has the 
following four components: (1) sensors that sense the world, just like a human 
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cinematographer has eyes and ears; (2) cameras that capture the scenes, just like human 
cinematographers have their video cameras; (3) framing rules that control the camera 
operation, just like human cinematographers have their knowledge of how to frame a camera 
view; and (4) virtual cinematographer/virtual director communication rules, just like human 
cinematographers need to communicate with their director. At the upper level, the virtual 
director of the present invention collects status and events information from each virtual 
cinematographer and controls the mixer to decide which of the multiple camera views from the 
camera system should be the current camera view seen by the user. The edited lecture video is 
then encoded for both live broadcasting and on-demand viewing. 

FIG. 4 is a general block diagram illustrating a preferred implementation of the major 
components of the present invention. The camera system 305 captures the lecture and provides 
multiple camera views of the lecture. The camera system 305 may include a variety of cameras. 
In particular, the camera system 305 may include the lecturer-tracking camera 1 30 for tracking 
a lecturer, the audience-tracking camera 1 25 for providing audience images and tracking 
desire audience members, the overview camera 135 for providing an overview of the lecture 
environment, and the visual presentation tracking camera 145 for tracking and providing 
camera views of visual presentations used the lecture. 

In addition to using the camera system 305 to capture the lecture, the automated video 
production system 100 may also capture any lecture-based sensory aids 410 that are used 
during the lecture. These lecture-based sensory aids 410 include visual aids 420, audio aids 
430 (such as audio recordings), and other sensory aids 440 (such as motion pictures). These 
lecture-based sensory aids 41 0 typically are used by a lecturer during a lecture to make the 
lecture more interesting and to emphasize a particular point. The camera system 305 and the 
lecture-based sensory aids 410 are in communication with the automated camera management 
module 1 50 through communication links 450, 460. 

The automated camera management module 1 50 includes the lecturer-tracking 
cinematographer 300 in communication with the lecturer-tracking camera 130 that provides 
control of the lecturer-tracking camera 130. Further, the automated camera management 
module 1 50 includes the audience-tracking cinematographer 31 0 in communication with the 
audience-tracking camera 1 25 that controls the audience-tracking camera 1 25. Also included 
in the automated camera management module 1 50 is a virtual director 340 for receiving input 
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from the lecturer-tracking cinematographer 300, the audience-tracking cinematographer 310, 
the overview camera 135 and the visual presentation tracking camera 145. One function of the 
virtual director 340 is to choose a current camera view that is viewed by a user from the 
multiple camera views provided by the camera system 305. In addition, the virtual director 145 
receives input from the lecture-based sensory aids 41 0 (if they are being used) and determines 
how to incorporate them into the current camera view. For example, visual aids 410 such as 
slides may be incorporated into current camera view to produce a final video output. The virtual 
director 340 controls the mixer 350 by instructing the mixer 350 which of the multiple camera 
views from the camera system 305 to use as the current camera view. 

[0050] In general, data from the camera system 305 and lecture-based sensory aids 41 0 are 
p communicated to the mixer 350 and control signals from the virtual director 340 are 
*fe communicated to the lecture-based sensory aids 41 0, the overview camera 135, the visual 

presentation tracking cameral45 and the mixer 350. In addition, control signals are 
m communicated between the virtual director 340, the lecturer-tracking cinematographer 300 

and the lecturer-tracking camera 1 30 as well as the audience-tracking cinematographer 310 
s and the audience-tracking camera 1 25. Once the virtual director 340 determines which of the 
Si multiple camera views from the camera system 305 to use as the current camera view. The 

se [ ectec j camera view (along with any other data from the lecture-based sensory aids 41 0) is 
Q sent from the mixer 350 to the encoder 360 for encoding. The encoded lecture video is sent 
r " over a communication channel 470 for viewing by a remote audience. The remote audience 

views and interacts with the lecture video using a remote audience graphical user interface 1 70. 

[0051] III. Component Details 
[0052] 

The automated video production system of the present invention includes one or more 
virtual modules having a specific function. These virtual modules may be included or left out of 
the system depending on the desired result and the size and layout of the lecture room where 
the lecture occurs. These virtual modules include the lecturer-tracking virtual cinematographer, 
for tracking and filming the lecturer and the audience-tracking virtual cinematographer, for 
locating and filming audience members. The virtual modules also include the visual 
presentation tracking virtual cinematographer that films and tracks (if necessary) any visual 
presentations accompanying the lecture and the overview virtual cinematographer that films the 
podium area and servers as a backup in case the other virtual cinematographers are not ready 
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or have failed, in addition, the virtual director selects the final video stream from the camera 
system 305. The lecturer-tracking virtual cinematographer, audience-tracking virtual 
cinematographer and the virtual director will now be discussed in detail. 

[0053] Lecturer-Tracking Virtual Cinematographer 

[0054] The lecturer is a key object in the lecture. Accurately tracking and correctly framing the 

lecturer therefore is of great importance. The lecturer-tracking virtual cinematographer follows 
the lecturer's movement and gestures for a variety of camera views: close-up to focus on 
expression, median shots for gestures, and long shots for context. In general, the lecturer- 
tracking virtual cinematographer includes four components: a camera, a sensor, framing rules 
and communication rules. A camera used in the lecturer-tracking virtual cinematographer is 

,|i preferably a pan/tilt/pan cameras or active camera. The active camera is controlled by a device 

^ such as a computer. 

[005% Tracking a lecturer in a lecture room environment imposes both challenges and 

fl opportunities. The present invention includes three implementations or configurations that may 

§ be used in the present invention to track the lecturer. Each of these implementations may be 

J used with one another or independently. Each of these implementations will now be discussed. 

[0056|p The first implementation for tracking a lecturer is also the simplest. This implementation 
uses a single active camera as both the sensor and the camera. Even though simple, there are 
two potential problems: 

[0057] 1 . Because frame-to-frame differencing motion tracking is used to avoid extra motion 

compensation steps, it is necessary to stop the camera first before capturing the frames. The 
stop-and-move mode results in unpleasant video capturing; 

[0058] 2. Because tight shots of the lecturer are desired, the camera normally operates in the high 
zoom mode. The very narrow field-of-view (FOV) in the high zoom mode causes the camera to 
lose track of the lecturer quite easily. 

[0059] 

A second implementation increases the active camera's FOV by using a wide-angle camera 
as the sensor and attaching the wide-angle camera directly to the active camera. When the two 
cameras are attached and optical axes aligned, the lecturer-tracking virtual cinematographer 
uses the wide-angle camera to locate the target and then uses the target location to control the 
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active camera. 



[0060] Because of its wide FOV, the wide-angle camera introduces a large radial distortion that is 
proportional to the distance from its frame center. Normally, this distortion must be corrected 
via camera intrinsic parameter estimation before the target location can be calculated. In this 
setup, however, because the two cameras move together, the target mostly will appear close to 
the center of the frame and it is not necessary to conduct the correction, in other words, this 
setup solves the problem of keeping the camera view tight on the lecturer but does not solve 
the problem of the camera moving around too much. 

[0061] A third implementation addresses the first problem by having the wide-angle camera not 
move with the active camera. For this reason the wide-angle camera is mounted on a static 

yg base right above the active camera and the wide-angle camera's optical axis is aligned to that 

^ of the active camera's home position. This third implementation is preferred because both the 

M* first and second problems are solved by this implementation. However, this implementation 

[j does have a cost. Compared with the wide-angle camera attached to the active camera setup, 

US this implementation requires an extra camera calibration step because of the wide-angle 

rj camera's distortion. In contrast to the wide-angle camera attached to the active camera setup, a 

tl [ecturer can a PP e ar anywhere in the wide-angle camera frame, including the highly distorted 

«p regions of the wide-angle camera's boundaries. Techniques that unwarp the image are available 

J2 and well known to those having ordinary skill in the art. 

[0062] Audience-Tracking Virtual Cinematographer 

[0063] Showing the audience members who are asking questions is important to make useful and 

interesting lecture videos. The present invention uses a sensing modality based on microphone 
arrays, where the audience-tracking virtual cinematographer first estimates the sound source 
direction using the microphones and then uses the estimation to control the active camera. This 
technique is called a sound source localization (SSL) technique. 

[0064] 

In general, three types of SSL techniques exist in the literature: (a) steered-beamformer- 
based; (b) high-resolution spectral-estimation-based; and (c) time-delay-of-arrival (TDOA) 
based. The first two types of techniques are computationally expensive and not suitable for 
real-time applications. The preferred technique is the TDOA-based techniques, where the 
measure in question is not the acoustic data received by the sensors, but rather the time delays 
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between each sensor. 



[0065] Within various TDOA approaches the generalized cross-correlation (CCC) approach is one 
of the most successful. Mathematically, the CCC approach may be described as follows. Let s(n) 



be the source signal, and x ^ (n) and x (n) be the signals received by the two microphones: 



[0066] 



Xj (n) = as(n -D) + h x (n) * s(n) + n x (n) 
x 2 (n) = bs{n) + h 2 (n) * s(n) + n 2 (n) 

[0067] where Z7is the TDOA, a and b are signal attenuations, n ^ fa; and n (n) are the additive 
noise, and h ^ fa; and h (n) represent the reverberations. Assuming the signal and noise are 
°E uncorrelated, £>can be estimated by finding the maximum GCC between x ^ (n) and x (n): 

[0068R! 

m £ = argmaxi^ 2 (r) 

[0069U where R ( r ) is the cross-correlation of x (n) and x ^ (n), G r r (a>) is the Fourier 
a transform of R ( T ) > i.e., the cross power spectrum, and W(w) is the weighting function. 

[0070] In practice, choosing the right weighting function is of great significance for achieving 
accurate and robust time delay estimation. As can be seen from equation (1), there are two 
types of noise in the system, i.e., the background noise n ^ (n) and n fa; and reverberations h 
1 fa; and h fa;. Previous research suggests that the maximum likelihood (ML) weighting 
function is robust to background noise and phase transformation (PHAT) weighting function is 
better dealing with reverberations: 



[0071] 



W ML {co) = - 1 



N(a>) 

W phat( co ) = 1 
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[0074] 



2 

where | N(w) | is the noise power spectrum. 

It is easy to see that the above two weighting functions are at two extremes, in other words, 

W (w) puts too much emphasis on "noiseless" frequencies, while W (w) completely 
ML PHAT 

treats all the frequencies equally . To simultaneously deal with background noise and 

reverberations, a technique for use in the present invention has been developed as follows. 

Initially, W (w), which is the optimum solution in non-reverberation conditions is used. To 
ML 

incorporate reverberations, a generalized noise is defined as follows: 



N\co) \ 2 =\ H(m) | 2 1 S(co) | 2 + 1 N(co) 



[0075JO Assuming the reverberation energy is proportional to the signal energy, the following 
^ weighting functions are obtained: 



W{col) 1 



Y\G XlX2 m+(l-r)\N(a>)\ 2 

where y g j^q jj is the proportion factor. 

Once the time delay D \s estimated from the above procedure, the sound source direction 
can be estimated given the microphone array's geometry. FIG. 5 is a geometric diagram 
illustrating the sound source localization technique of the present invention. As shown in Figure 
5, the two microphones are at locations A and B, where AB is called the baseline of the 
microphone array. Let the active camera be at location O, whose optical axis is perpendicular to 
AB. The goal of SSL is to estimate the angle /^QQX such that the activ ^ camera can point at 
the right direction. When the distance of the target, i.e., |OC|, is much larger than the length of 
the baseline |AB|, the angle ^QQX can be estimated as follows: 



ZCOX * ZBAD - arcsin i^i = arcsin ( 3 ) 

1^1 \AB\ 

where D is the time delay and v - 342 m/s is the speed of sound traveling in air. 
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Several potential sensor-camera setups may be used to track audience members. First, a 
microphone may be attached to the camera. Just like a human head has two ears the two 
microphones may be attached to the left and right sides of the active camera. In this 
configuration, an estimate is made to determine whether the sound source is from left or from 
right, and there is no need to know the exact direction. This configuration, however, has two 
major problems in the lecture room context. 

One problem is that the camera's width is not wide enough to be a good baseline for the 
microphone array. As can be seen from Equation (3), the SSL resolution is inversely proportional 
to the length of the baseline. A small baseline will result in poor resolution. A solution to this 
problem is to attach an extension structure to the camera and then attach the microphones to 
that structure to extend the microphone array's baseline. However, this solution leads to the 
second problem of this configuration, distraction . Local audience members do not want to see 
moving objects that may distract their attention. That is why in most of lecture rooms the 
tracking cameras are hidden inside a dark dome. In this configuration, however, since the 
microphones are attached to the active camera, the whole tracking unit has to be outside the 
dome in order for the microphones to hear. By extending the baseline of the microphone array, 
we will increase the distraction factor as well. The distraction factor of such a setup makes it 
unusable in real lecture rooms. 

An alternative solution is to have static microphones and a moving camera. In this 
configuration the microphone array is detached from the active camera, but the microphone 
array's baseline is kept perpendicular to the camera's optical axis to ensure easy coordinate 
system transformation. By separating the microphones from the camera, we have a more 
flexible configuration. For example, the camera can be hidden inside a dark dome above the 
microphone array. In addition, because the microphone array is static, we can have a much 
larger baseline without causing any movement distraction. An example of a preferred baseline 
is 22.5 cm. 

Virtual Director 

The responsibility of the virtual director is to gather and analyze reports from different 
virtual cinematographers, to make intelligent decisions on which camera view to select as the 
current (or output camera view), and to control the mixer to generate the final video output. 
Just like human video directors, a good virtual director observes the rules of cinematography 
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and video editing in order to make the lecture video more informative and entertaining. These 
expert video production rules are discussed in detail below. 



[0086] IV. Expert Video Production Rules 

[0087] Current automated video production techniques produce a video by focusing on technology 
rather than on what an audience prefers to see. Thus, these current techniques lack a 
systematic study on expert video production rules. Expert video production rules are those 
rules used by video production professionals to produce high-quality videos. These expert 
video production rules provide a set of constraints for the video production process. For 
example, there are rules used by video professionals on how to frame a speaker (such as a 
m lecturer) and how to transition from one camera shot to the next. Following these rules makes 
yS it more likely that the video will be of high quality and that a viewer will enjoy the viewing 
m experience. For example, one expert video production rule states that each shot should be 

longer than a minimum duration. Violating this rule produces a video that quickly jumps from 
y one camera shot to the next and is highly distracting to a viewer. If a video were produced that 
f 5 continuously violated this rule, the video may be so distracting that the viewer simply stop 
O watching the video or may miss valuable information presented therein. 

S S_ 

[0088LX In general, the expert video production rules used by the present invention may be 

O obtained from a variety of sources. By way of example, these rules may be obtained from a 
textbook, an interview with a video professional, a technical paper, or by observing the work of 
a video professional. A set of expert video production rules is generated by obtaining 
information from many sources (such as interviews with several video professionals) and 
including the most common rules within the set. If certain rules conflict they either can be 
excluded or the rules having the most weight can be included. By way of example, a rule may 
be considered to have more weight if the rule was from a video professional who has a great 
deal of experience or if the rule was obtained from a classic textbook on video production. 
These expert video production rules may also be implemented in a system whereby the set of 
expert video production rules may be expanded. This expansion may occur through the use of 
machine learning techniques (such as neural networks). Thus, the set of expert video 
production rules may be changed as the automated video production system "learns" new rules. 

[0089] l n a p re f errec j embodiment, the present invention uses expert video production rules 

obtained from discussions with seven professional video producers. These rules are categorized 
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into three types of rules: (1) camera setup rule, dictating a rule about how the cameras should 
be set up; (2) lecturer framing rules, dictating how to frame a lecturer; and, (3) editing (or 
virtual director) rules, including rules dictating how to edit the video inputs and how to choose 
an camera output. These expert video production rules may be summarized as follows. 

Camera Setup Rule 

The camera setup rule governs the set up of the cameras used to capture a lecture. This 
rule prescribes how cameras should be set up to provide the best possible finished video 
product. The camera setup rule is based on a "line of interest". In video production, especially 
in filmmaking, there is a "line of interest". This line of interest can be a line linking two people, 
a line a person is moving along, or a line a person is facing. The camera setup rule (or the "line 
of interest" rule) states: 

1 . Do not cross the line of interest. 

For example, if an initial shot is taken from the left side of the line, subsequent shots 
should be taken from that side. This rule will ensure that a moving person maintains the 
direction of apparent motion. This rule can only be violated when a neutral shot is used to 
make the transition from one side of the line to the other. Following this camera setup rule 
when setting up the cameras used to capture a lecture helps to ensure success. 

Referring to FIG. 1 , the lecturer 1 1 0 normally moves behind the podium 1 1 5 and in front of 
the screen 140. In this lecture room environment, when the object of interest is the lecturer 
1 1 0, the "line of interest" is the line that the lecturer 1 1 0 is moving along: a line behind the 
podium 1 1 5 and in front of the screen 140. The camera setup shown in FIG. 1 satisfies the "line 
of interest" rule of not crossing this line. When the object of interest is the audience, the line of 
interest is the line linking the lecturer and the audience. The camera setup shown in FIG. 1 
satisfies the rule in this case as well. 

Lecturer Framing Rules 

The lecturer framing rules prescribe how to frame a lecturer. The lecturer is the most 
important object in a lecture. Thus, correctly framing the lecturer is of great importance. In this 
preferred embodiment, the rules for framing the lecturer state: 

2. Do not move the lecturer-tracking camera too often — only move when the lecturer 
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moves outside a specified zone. 
[0098] 3. Frame the lecturer so that there is half-a-head of room above the lecturer's head. 
[0099] Virtual Director Rules 

[01 00] The virtual director rules are the rules that prescribe how video inputs should be edited and 
how a current camera view should be selected from the multiple camera views provided by the 
camera system. The following rules govern what a virtual director should do with the multiple 
camera views sent from multiple virtual cinematographers. 

[01 01 ] 1 . Establish the shot first. In lecture filming, it is good to start with an overview shot such 
n that remote audiences get a global context of the environment. 

[01 02]^j 2. Do not make jump cuts — when transitioning from one shot to another, the view and 
M number of people should be significantly different. Failing to do so will generate a jerky and 
sloppy effect. 

[01 03]s 3. Do not cut to a camera that is too dark. This will ensure better final video quality. 

[01 04JM 1 4. Each camera shot should be longer than a minimum duration D . (preferably 

%dz mm 

„p approximately four seconds). Violating this rule is distracting for the remote audience. 

[01 OSf^ 5. Each shot should be shorter than a maximum duration D . Violating this rule makes 

max 

the video boring to watch. The value of D is different depending on which camera is used. 

max 

[01 06] 6. When all other cameras fail, switch to safe back-up cameras (which in this preferred 

embodiment is the overview camera). 

[0107] 7. When a person in the audience asks a question, promptly show that person. This is 
important for remote audience members to follow the lecture. 

[0108] 8. Occasionally, show local audience members for a period of time (e.g., 5 seconds) even if 
no one asks a question. This will make the final video more interesting to watch. 

[01 09] The first six rules are generic while the last two specifically deal with how to properly select 
audience shots. For rule 1 , the present invention starts with an overview camera shot to 
establish the lecture context. For rule 2, the camera setup of the present invention as shown in 
FIG. 1 ensure that there are no "jump cuts" because the views of the cameras are significantly 
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different from each other. For rule 3, the gain control of each camera in a lecture room must be 
carefully calibrated such that shots meet the brightness requirement. As for rules 4 through 8, 
the present invention includes an audience camera and the virtual director of the present 
invention uses these rules to select a current camera view. 

[01 1 0] V. Operation of the Invention 

[0111] FIG. 6 is a detailed block diagram illustrating the interaction between the audience-tracking 
cinematographer 310, the lecturer-tracking cinematographer 300, the virtual director 340 and 
the mixer 350 shown in FIG. 4. The audience-tracking cinematographer 310 is in 
communication with the audience-tracking camera 1 25 and for providing images of audience 
PI members and includes a microphone array audience tracker 600 that controls the audience- 
!y tracking camera 1 25. The audience tracker 600 uses a microphone array (not shown) and 
£S expert video production rules 61 0 (such as those discussed above) to control the audience- 
|£i tracking camera 125 and frame each camera view. An audience-tracking status module 620 
W provides status information to the virtual director 340. 

[01 1 2|U Similarly, the lecturer-tracking cinematographer 300 is in communication with the lecturer- 
£0 tracking camera 1 30 for providing images of the lecturer and includes a history-based, 
"l: reduced-motion lecturer tracker 630 that controls the lecturer-tracking camera 1 30. The 
H expert video production rules 61 0 are used by the lecturer tracker 630 to control the lecturer- 
tracking camera 1 30 and a lecturer-tracking status module 650 reports the status (such as 
"ready" or "not ready") of the lecturer-tracking camera 1 30 to the virtual director 340. 

[01 13] lecturer-tracking cinematographer 300 includes the history-based, reduced motion 

tracker 630. This tracker 630 operates by zooming in or out on a subject (such as a lecturer or 
audience member) depending on the history of the subject's movement. For example, if a 
lecturer has a history of frequently changing locations the tracker 630 takes this into account 
and set an appropriate camera zoom to capture the lecturer. Once the camera zoom is set the 
tracker 630 generally does not move the camera (either to the left and right or zoom in and 
out) but remains fixed. Thus, based on the history of movement of the subject, the tracker 630 
sets up a camera view and remains fixed until the subject moves out of camera view. Once this 
happens, the virtual director 340 generally assigns another camera as the output camera. The 
history-based, reduced motion lecturer tracker 630 of the present invention ensures that the 
lecturer-tracking camera 1 30 camera is not continually zooming in or out or panning left and 
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right and thus reduces distractions to the viewer. 

[01 14] In general, the history-based reduced-motion feature of the present invention does not 

move once the camera locks and focuses on the lecturer. Camera movement only occurs if the 

lecturer moves out of the frame or if the virtual director switches to a different camera. In 

particular, let (x ^ ,y ) be the location of the lecturer estimated from the wide-angle camera. 

According to the history-based reduced-motion feature of the present invention, before the 

virtual director cuts to the lecturer-tracking camera at time t, the lecturer-tracking virtual 

cinematographer will pan/tilt the camera such that it locks and focuses on location (x ,y ). To 

determine the zoom level of the camera, lecturer-tracking virtual cinematographer maintains 

the trajectory of lecturer location in the past T seconds, (X,Y) = {(x ,y ^ ),..., (x ,y ),..., (x 

Q j >V j Currently, T is set to 1 0 seconds. The bounding box of the activity area in the past T 

fS seconds is then given by a rectangle (X , Y , X ,Y ), where they are the left-most, top- 
Si! L T R B 

^ most, right-most, and bottom-most points in the set (X,Y). If it is assumed that the lecturer's 

s-ssis 

CO movement is piece-wise stationary, then (X ,Y ,X ,Y ) is a good estimate of where the 
i" L T R B 

fZ lecturer will be in the next T seconds. The zoom level is calculated as follows: 
[01 1 513 

S ' . , HFOV VFOV . ... 

.> Z r =mm( , ) (4) 

S L AX R ,x L ) AY n J r ) 

[01 1 6] where HFOVdind VFOV are the horizontal and vertical field of views of the active camera, 
and j/J^m represents the angle spanned by the two arguments in the active camera's 

coordinate system. The history-based reduced-motion feature of the present invention allows 
the lecturer-tracking virtual cinematographer to control the active camera in such to reduce 
camera movement while also maintaining the tightest camera views possible. 

[01 1 7] The microphone array audience tracker 600 uses a microphone array-based technique to 
track audience members who are talking. In one embodiment, the type of microphone array- 
based approach used is the sound source localization (SSL) technique described in detail above. 
The SSL approach uses correlation techniques to find the time difference between an audio 
signal arriving at two microphones. From the time difference and microphone array's geometry, 
the sound source location can be estimated. 

[01 1 8] Communication Between Virtual Cinematographers and the Virtual Director 
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[01 1 9] Each of the cameras reports a status to the virtual director 340. Each virtual 

cinematographer periodically reports its status , camera zoom level Z ' , and tracking 
confidence level C ^ to the virtual director. The virtual director collects all the , Z ^ , and C ^ 
from the virtual cinematographers. Based on the collected information and history data, the 
virtual director then decides which camera is chosen as the final video output and switches the 
mixer to that camera. The virtual director also sends its decision D back to the virtual 
cinematographers to coordinate further cooperation. 

[01 20] When the lecturer-tracking cinematographer reports to the virtual director, 5 takes on 

values of {READY, NOTREADY}. 5^ is READY when the camera locks and focuses on the target. S 

y is NOTREADY when the camera is still in motion or the target is out of the view of the camera. 

CJ1 Z ^ is computed by using Equation (4) and normalized to the range of [0,1]. C ^ is 1 if the target 
is inside the active camera's FOV and is 0 otherwise. It should be noted that the statuses 

H {READY, NOTREADY} are just examples. There can be much more complex statuses, such as 

111 trying to focus, performing panning, done static, and so forth. 

[$p21] Communication between the audience-tracking virtual cinematographer and the virtual 

director occurs as follows. Once the microphone array detects a sound source and estimates 
J! the sound source direction with enough confidence, the audience-tracking virtual 
U cinematographer will pan the active camera to that direction. Status 5^ is READY when the 

active camera locks on the target and stops moving. 5 is NOTREADY when the active camera 
is still panning toward the sound source direction. 

[01 22] To obtain a good estimate of the confidence level C , a natural quantity associated with 
GCC is used. Namely, the correlation coefficient, p , represents how correlated two signals 
are, and thus represents how confident the TDOA estimate is. Its value always lies in the range 
of [-1 ,+!]. C can then be computed as follows: 



[0123] 



C,=(p + l)/2 



(0)^(0) 

[01 24] where ^ ^ is defined in Equation (2). 
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[01 25] In addition to promptly showing the talking audience members, an important framing rule 
for the audience-tracking virtual cinematographer is to have the ability to show a general shot 
of the audience members even though none of them are talking. This added ability makes the 
recorded lecture more interesting and useful to watch. These types of camera views are 
normally composed of a slow pan from one side of the lecture room to the other. To support 
this framing rule, the audience-tracking virtual cinematographer r s status 5_^_ takes an extra 
value of {GENERAL} in addition to {READY, NOTREADY}, and the virtual director's decision 
takes an extra value of {PAN} in addition to {LIVE, IDLE}. Status 5^ equals {GENERAL} when the 
camera is not moving and the microphone array does not detect any sound source either. After 
the virtual director receives status 5^ = {GENERAL} from the audience-tracking virtual 
cinematographer, it will decide if it wants to cut to the audience-tracking virtual 
;~f cinematographer's camera. If so, the virtual director will send decision D = {PAN} to the 

CP audience-tracking virtual cinematographer. Upon receiving this decision, the audience-tracking 

n virtual cinematographer will slowly pan its camera from one side of lecture room to the other. 

ten 26] Referring again to FIG. 6, the audience-tracking cinematographer 31 0 and the lecturer- 
l 3 tracking cinematographer 300 send data to the mixer 350 and to the virtual director 340. The 
mixer 350 receives video data from the cinematographers 300, 310 and the overview camera 
M, 135. The virtual director 340 receives status and control data from the two cinematographers 

jj! 300, 310 and the overview camera 1 35. Based on this status and control data, the virtual 

H director 340 determines which of the cameras is the camera output. One purpose of the virtual 

director 340 is to gather and analyze reports from the cinematographers 300, 310 and to 
control the mixer 350 to generate the final video based on expert video production editing 
rules. The virtual director 340 uses two important components to achieve this goal: a status 
vector module 660 to maintain the status of each cinematographer 300, 310 and an event 
generator 670. The event generator 670 includes a module that generates a triggering event 
that triggers a switching from one camera view to another and a finite state machine (FSM) to 
decide to which camera view to switch. 

[01 27] The event generator 670 j nc | uc | es an internal timer 675 to keep track of how long a 

particular camera has been on, and a report vector R = {S^ , Z ^ , C ^ /to maintain each virtual 
cinematographers status 5^ , zoom level Z ^ and confidence level C . The event generator 
670 is capable of generating two types of triggering events that cause the virtual director 
module 340 to switch cameras. One type of event is called a "status change" event. A status 
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change event occurs when a status of a cinematographer changes. The second type of event is 
called a "time expire" event. The time expire event occurs if a camera has been on for longer 
than a predetermined amount of time, as determined by the timer 675. Both the status change 
triggering event and the time expire triggering event are discussed in detail below. 

[01 28] The status vector module 660 processes status information received from the 

cinematographers 300, 310. The status vector module 660 maintains the status element, 5_^_ , 
of the report vector R = {S ^ ,Z , C ^ / The status element keeps track of the status of each 
cinematographer. By way of example, if there are three cameras in the system the report vector 
has three elements. These three elements represent the current information from the lecturer- 
tracking cinematographer 300, the audience-tracking cinematographer 310 and the overview 

Q camera 1 35. In this example, the status element of the lecturer-tracking cinematographer 300, 
5_j_ takes two values, {READY, NOTREADY}. The status element of the audience-tracking 

fS cinematographer 310, [2], takes three values, {READY, NOTREADY, GENERAL}. Because the 

nj overview camera is a safe back-up camera, its status element, S ^ [3] , takes only one value, 

{READY}. Together, these status elements represent a combination of 2x3x1 =6 overall statuses 

s for the entire system. It should be noted that this example is merely illustrative and more or 

j5S less values may be used for the status element. 

[S3 29] The event generator 670 constantly monitors the report vector R = { , Z ^ , C ^ }. If any 
=r7 cinematographer reports a status change, then a status change event is generated. The virtual 
director module 340 then takes action in response to this status change event. For example, on 
response by the virtual director module in response to a status change event is to switch to a 
different camera. 

[01 30] The event generator 670 determines which of the multiple camera view provided by the 
camera system should be the current camera view. The event generator 670 includes a time 
transition module 680, a location transition module 685 and the expert video production rules 
61 0. In general, the event generator 670 determines transitions from one camera view to 
another camera view based on a triggering event. The time transition module 680 determines 
when a transition should occur and generates a time expire triggering event. The location 
transition module 685 determines to which camera the transition should proceed. Both the time 
transition module 680 and the location transition module 685 follow the expert video 
production rules 61 0 when determining when and where to transition. 
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[01 31] The time transition module 680 generates a time expire event as follows. In video 

production, switching from one camera to another is called a cut . The period between two cuts 
is called a video shot . An important video editing rule is that a shot should not be too long or 
too short. To ensure this rule, each camera has its minimum shot duration D and its 

MIN 

maximum allowable duration D ■ If a shot length is less than D ^ , the no switching 
between cameras will occur. On the other hand, if a camera has been on longer than its D f 
a time expire event will be generated. By way of example, D may be set to approximately 5 
seconds for all cameras based on the suggestions of professional video producers. 

[01 32] Two factors affect the length of a shot, D . One factor is the nature of the shot and the 
other factor is the quality of the shot. The nature of shot determines a base duration D for 

O each camera. Lecturer- tracking shots generally are longer than overview shots, because they 

m generally are more interesting. By way of example, exemplary values for D are 60 seconds 

Hf for the lecturer-tracking camera when 5 = READY, 10 seconds for the audience-tracking 

ffl camera when = READY, 5 seconds for the audience-tracking camera when = GENERAL, 

jsS' and 40 seconds for the overview camera when 5^ = READY. 

PI 33] The quality of a shot is defined as a weighted combination of the camera zoom level Z 

fv and the tracking confidence level . Quality of the shot affects the value of D such that 

JC high-quality shots should last longer than low-quality shots. Thus, the final D is a product 

U of the base length D and the shot quality: 

BASE 



[0134] 



[0137] 



C>MAX = D BMAX X (« Z I + (1 - «) C i) 



[01 35] where OC is chosen experimentally. By way of example, an exemplary value CC for is 

a = 0.4- 

[01 36] Upon receiving a triggering event, the virtual director uses the location transition module 
685, which in this case is a multiple state probabilistic finite state machine (FSM), to determine 
to which camera to switch. In this preferred embodiment, the multiple-state probabilistic finite 
state machine 685 determines at any given moment which camera to switch to upon receipt of 
a triggering event. 



By way of example, FIG. 7 illustrates a three-state probabilistic finite state machine 700. In 
this embodiment, the three-state probabilistic finite state machine 700 includes a lecturer- 
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tracking camera state 710, an audience-tracking camera state 720, and an overview camera 
state 730, each represented by a state. The three states are fully connected to allow any 
transition from one state to another. When to transition is governed by the triggering events 
described above. Where to transit is determined by the transition probabilities of the finite state 
machine (FSM). 

[01 38] The transition probabilities of the FSM use the expert video production rules 610. As an 
example, one expert video production rule states that a cut should be made more often from 
the lecturer-tracking camera to the overview camera than to the audience-tracking camera. 
This rule is used by the FSM by making the transition probability of the lecturer-tracking 
camera higher than that of the audience-tracking camera. At a microscopic level, each camera 
transition is random, resulting in interesting video editing effects. At a macroscopic level, some 

2? transitions are more likely to happen than others, obeying the expert video production rules. 

39] VI. Working Example and User Study Results 

pl 40] The following discussion presents a working example of an implementation of the 

Q automated video production system and method discussed above. This working example is 

p r: provided for illustrative purposes and is one of several ways in which the present invention may 

<C be implemented. In addition, results from user study are presented. This user study had two 

ff objectives. First, the study sought to evaluate how much each individual expert video 

production rule affected a remote audience's viewing experience. Second, the study sought to 
compare the overall video quality of the automated camera management module to that of a 
human camera operator. The human operator that was uses in this study was a professional 
camera operator with many years of experience in photo and video editing. 

[0141] The automated video production of the present invention was deployed in a lecture room, 
similar to the lecture room shown in FIG. 1 . In this working example, there were three cameras 
and the projection device in the lecture room. This lecture room is used on a regular basis to 
record lecture and broadcast them live to employees at their desktops as well as being archived 
for on-demand viewing. The human camera operator used in this study is experienced in using 
the three cameras and projection device in this lecture room to capture lectures. 

[0142] 

In order to make a fair comparison between the present invention and the human camera 
operator, the lecture room was restructured so that both the human operator and the present 

Page 29 of 45 



invention used three cameras and the projection device. Both the human operator and the 
present invention use the same static overview camera and projection device camera, while 
having separate lecturer-tracking cameras and separate audience-tracking cameras that are 
placed at close-by locations. The present invention and human operator also used independent 
mixers. 

[01 43] Two studies were conducted: a first study was a field study using a major organization's 
employees while the second study was a lab study with participants recruited from local 
colleges. For the field study, four lectures were used, including three regular technical lectures 
and a fourth general-topic lecture on skydiving held specifically for this study. The skydiving 
lecture was also used for the lab study. In the first study, a total of 24 employees watched one 
£3 of the four lectures live from their desktops in the same way they would watch any other 
!S lecture. While providing a realistic test of the present invention, this study lacked a controlled 
B environment and the remote audience members may have watched the lecture while doing 
5 other tasks like reading e-mail or surfing the web. The second study was a more controlled 
5f study and was a lab study that used eight college students from local colleges. College students 
T were recruited because of their experience in watching lectures in their day-to-day life. 

1^144] In both studies, the remote audience graphical user interface 800 shown in FIG. 8 was used. 
lc The left portion of the interface 800 includes a MediaPlayer window 81 0, which is manufactured 
f 3 by Microsoft Corporation located in Redmond, Washington. The window 81 0 displays the video 
edited by the virtual director 340 of the present invention. In other words, the outputs of the 
lecturer-tracking camera 1 30, the audience-tracking camera 1 25 and the overview camera 1 35 
are first edited by the virtual director 340 and then displayed in the window 81 0. Under the 
window 81 0 are controls 820 for controlling the video within the window including "play", 
"stop" and "pause" buttons. 

[0145] 

The right portion of the interface 800 includes a slide window 830 that displays lecture 
slides that are synchronized with the video. The output of the visual presentation tracking 
camera 145 is displayed directly in the slide window 830. In an alternate embodiment, the slide 
window 830 is eliminated and the output from the visual presentation tracking camera 1 45 is 
integrated into the window 81 0. In the example shown in FIG. 8, the slide includes a first 
picture 840 and a second picture 850 along with text. A lecture control bar 860 contains 
controls (such as controls that allow a lecturer biography to be presented) for controlling 
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information regarding the lecture and a slide control bar 870 contains controls (such as 
previous slide or next slide) that allow user control of the slides presented in the slide window 
830. 

[01 46] All four lectures for the study were captured simultaneously by the human camera operator 
and the automated camera management module of the present invention. When study 
participants watched a lecture, the version captured by the human operator and the version 
capture by the automated camera management module captured version alternated in the 
MediaPlayer window of the remote audience graphical user interface. For the three 1 .5-hour 
regular lectures, the two versions alternated every 1 5 minutes. For the half-hour skydiving 
lecture, the two versions alternated every 5 minutes. The version presented first was 

/JJ randomized. After watching each lecture, study participants provided feedback using a survey. 

[tS| 47] The survey was intended to test how well the present invention performed compared to the 

human camera operator. Performance was measured using questions based on each of the 
W expert video production rules as well as two Turing test questions that evaluate whether study 
s participants could correctly determine which video was produced by a person as opposed to the 

y automated video production system and method of the present invention. 

[0fl 48] Results from the survey shows that there is a general trend whereby a human operator is 
rated slightly higher than the automated system of the present invention. To push the 
comparison to an extreme, at the end of the survey we asked a simple Turing test: "do you 
think each camera operator is a human or computer?" The results clearly showed that study 
participants could not determine which system is the computer and which system is the human 
at any rate better than chance. For these particular lectures and study participants, the present 
invention passed the Turing test. 

[01 49] The data clearly show that study participants could not determine which system was the 

automated video production system and method and which was the human camera operator at 
any rate better than chance. There are at least two implications from this study. First, the 
present invention appears not to be making any obvious mistakes repeatedly that the study 
participants can notice. Second, many study participants probably realize that even human 
camera operators make mistakes. For example, a human camera operator may sometimes be 
tired, distracted, or plain bored by the lecturer and lecture content. 
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50] The foregoing description of preferred embodiments of the invention has been presented 
for the purposes of illustration and description. It is not intended to be exhaustive or to limit 
the invention to the precise form disclosed. Many modifications and variations are possible in 
light of the above teaching. It is intended that the scope of the invention be limited not by this 
detailed description of the invention, but rather by the claims appended hereto. 
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